EPA-1316 Introduction to Urban Data Science

Assignment 2: Geographic Visualisation¶


Instructions¶

This assignment puts together what you learned in Weeks 4-5. Assignment 3 will build upon what you do in Assignment 2.

Note: Go through labs and homeworks 03-04 before starting this assignment.

1.1 Submission¶

Please submit the results by Brightspace under Assignment 02, using a single file as example,

firstname_secondname_thirdname_lastname_02.html

If your file is not named in lowercase letters as mentioned above, your assignment will not be read by the script that works to compile 200 assignments and you will miss out on the grades. I don't want that, so be exceptionally careful that you name it properly. Don't worry if you spelled your name incorrectly. I want to avoid a situation where I have 200 assignments all called assignment_02.html

Please do not submit any data or files other than the html file.

Don't worry about your submission rendering without the images after you submitted the file on brightspace. That is a brigthspace related issue of viewing your own submission but when I download all assignments as a batch file, I get all your images and code as you intended to submit. So make sure that your html shows everything you want us to see before you submit.

1.2 How do you convert to HTML?¶

There are 2 ways,

  1. from a running notebook, you can convert it into html by clicking on the file tab on the main menu of Jupyter Lab
    • File → Export Notebooks as... → Export Notebook to HTML
  2. go to terminal or command line and type
    • jupyter nbconvert --to html <notebook_name>.ipynb

1.3 Learning Objectives¶

This assignment is designed to support three different learning objectives. After completing the following exercises you will be able to:

  • Combine different datasets
  • Explore and Visualise Geographic data
  • Plot (graphs, scatter plots, choropleth, etc..) and discuss (observations, outliers or relationships) important information about the data using the principles of graphical excellence and guidelines of exploratory data analysis.


Problem¶

Problem Statement:

  • For this assignment you will use two different datasets of The Hague, The Netherlands to formulate a hypothesis/RQ.
  • To formulate the hypothesis, provide at least two measurements that may be related to each other (for example: your hypothesis is that neighbourhoods with no green space are more prone to populations with mental health issues).
  • Be explicit about how you define these measurements using markdown cells (for example: how do you measure the amount of green space, and how do you measure the levels of mental health issues in the population?).
  • Observe that the measurements have a normative value attached to it (for example: according to your hypothesis, high levels of crime in a neighbourhood is of more interest). Please no not assume that there is only one normative definition of a certain measurement and skip your reasoning.
  • On the basis of the hypothesis and its associated measurements, you will conduct some exploratory/spatial data analysis and provide a list of five neighbourhoods in The Hague, The Netherlands characterised as areas of interest.

Note: I am not looking for mathematical equations as justification, but you are welcome to also form simple relations and show them in markdown.

A successful student example from last year,

I hypothesise that neighbourhoods of The Hague with higher average incomes and less income inequality are likely to have fewer reported crimes, due to lower levels of socioeconomic disadvantage.

Definitions of metrics:

  • Average income (average gross personal income per resident, euros)
  • Inequality (determined in this analysis based on the percentage of households in a neighbourhood with average income - higher percentage of average income = less inequality). This is arguably an oversimplification of the concept of inequality but is considered a suitable approximation given the available datasets.
  • Reported crime - all offences 2019 (this is normalised in terms of population. Note this results in some anomalous values for neighbourhoods with very low populations (e.g. industrial estates or parks), some of which have been removed
  • Population - population of each neighbourhood - used to adjust the crime statistics for population.


Data¶

As part of the assignment files, I have provided you with the first dataset: the shapefiles of The Hague. After you unzip the data, we’ll work with the files of the form neighborhoods.xxx, in all the different geographic formats. Put the data in a convenient location on your computer or laptop, ideally in a folder called data which is next to this jupyter notebook. I recommend taking a look at the file with format .json in a text editor like atom for any system or notepad++ for windows. These will also make your life easy for everything else on your computer. Make sure you’ve set your working directory in the correct manner.

It’s a big file and it may take a while to load onto your laptop and into Python (running on the jupyter labs environment).

As mentioned above in the problem introduction, you will use at least two datasets.

  1. First Dataset: Download Shapefiles provided with the assignment
  2. Second Dataset: Get a second dataset of your choice from The Hague city region using the links below (curate them as you like)

More Data Sources¶

You can find more data sources on Cities and Population, Climate indicators and Land-use in the following links.

Mikhail Sirenko has prepared the first dataset shapefiles with love and care so that you can connect it with either Den Haag Cijfers or CBS datasets without having to clean any data. For the rest of the datasets, you may need to clean up.

Note: Data from CBS is only complete upto 2017. You will have to subset the data on municpality using the variable name gm_naam = 's-Gravenhage and then subset on neighbourhood resolution using variable name recs = Buurt to get the data that can match the shapefiles we have provided.

  • https://denhaag.incijfers.nl/jive (available in English on the website)
  • https://www.cbs.nl/nl-nl/reeksen/kerncijfers-wijken-en-buurten-2004-2020
  • http://citypopulation.de/
  • https://www.census.gov/programs-surveys/geography.html
  • https://www.eea.europa.eu/data-and-maps
  • http://download.geofabrik.de/

In case you get more data as shapefiles, and want to play with projections, a nice guide for it is here¶



Tasks¶

For your convenience, the assignment has been divided into the following tasks,

  1. Use two datasets: merge 1 shapefile (already provided) and a csv file (you find and obtain)
  2. Formulate a hypothesis and the measurements you are going to use
  3. Clean your data and make it tabular for your own good! (think about weeks 1-2 and assignment 1)
  4. Carry out an exploratory data analysis (EDA)
  5. Report and justify your choice of the list of 5 neighbourhoods on the basis of your analysis.
    • Use at least 3 figures to support your analysis. Think about exploratory data analysis (build data, clean data, explore global/group properties).
    • These figures should have followed the principles of graphical excellence. Using markdown, write explicity under each figure at least 3 principles of excellence you have used to make it.
    • Create choropleths to display region-specific information (ex. population, voting choice or jobs availability) together with some other elements like the sea, canals, centroids, or amenities (you may try Open Street Maps data - using osmnx).
    • Be careful with the use of color (and try to be inclusive of color blind people)
    • Use one method from the lectures to discuss what you observe for your variable(s). Examples below, * local or global spatial autocorrelation * network measures * spatial weights / spatial lag * binning * feature scaling, normalisation or standardisation

*Remember to always document your code! Justify everything you do (cleaning data, analysisng data, exploring data, defining hypothesis or measurements, etc.) using markdown cells as you go through the notebook.*



Start your analysis¶


Analyzing The Hague

1. Import Libraries and Data

To start of I import the libraries I will be using for this assignment. These are:

In [ ]:
#Import Geopandas for geo-visualization
import geopandas as gpd
#Import Pandas for data analysis
import pandas as pd
#Import Seaborn for data visualization
import seaborn as sns
#Import Contextily for basemaps including OpenStreetMap
import contextily as cx
#Import Matplotlib for data visualization
import matplotlib.pyplot as plt
#Import weights for spatial analysis
from pysal.lib import weights
#Import numpy for array operations
import numpy as np
#Import esda for spatial autocorrelation
import esda

Afterwards, I import the neighborhoods that I want to analyze from the provided shapefile.

In [ ]:
 #I use the geopandas read_file function to read the shapefile.
den_haag_neighborhoods = gpd.read_file("data/neighborhoods.shp")
# I check the data by printing the first 5 rows of the data.
den_haag_neighborhoods.head()
Out[ ]:
neighb_cbs neigb_cijf geometry
0 Oostduinen 70 Oostduinen POLYGON ((4.30290 52.12832, 4.30298 52.12827, ...
1 Belgisch Park 71 Belgisch Park POLYGON ((4.28056 52.11706, 4.28053 52.11706, ...
2 Westbroekpark 73 Westbroekpark POLYGON ((4.29171 52.10745, 4.29181 52.10739, ...
3 Duttendel 74 Duttendel POLYGON ((4.32252 52.10768, 4.32252 52.10766, ...
4 Nassaubuurt 48 Nassaubuurt POLYGON ((4.31943 52.09247, 4.31943 52.09247, ...

I see that my shapefile provides me with the neighbourhoods of The Hague. It also provides me with the geometry of the neighbourhoods. This is important for plotting the data later on. I can identify the different neighborhoods with the columns "neighb_cbs" or "neighb_cijf", which facilitates the use with data from CBS or Den Haag Cijfers.

I want to get a quick overview of the spatial data.

In [ ]:
# use plot() to display shapefiles
den_haag_neighborhoods.plot()
Out[ ]:

I can easily identify the shape of The Hague and the different neighborhoods and the latitude and longitude of the city.


2. Problem Statement

In the following chapter, I will explain my hypothesis and the measurements I will use to test my hypothesis.

I hypothesize that neighbourhoods with Amount of people with low education and high population density tend to have a lower real estate value, as they will more likely have a lower income and live in multi-family homes. I will use data provided by CBS on neighborhood level to test this hypothesis. The data was downloaded at CBS I will use the following measurements:

  • a_inw amount of inhabitans
  • g_woz real estate value [x 1000 €]
  • a_opl_lg number of people with Amount of people with low education
  • bev_dich population density
| Variable Name                                 | Variable Code     |
| --------------------------------------------- | ----------------- |
| amount of inhabitans                          | a_inw             |
| real estate value                             | g_woz             |
| Amount of people with low education           | a_opl_lg          |
| population density                            | bev_dich          |

I furthermore, will create the measure "people with low education [%]", which is the number of Amount of people with low education people divided by the population.


3. Dataset

3.1 Loading data

As the website provides the data in a .xls file, I will use pandas read_excel method to load the data and create a dataframe saved in the variable csb_data. After loading the data, I will check the first 5 rows of the data to see if the data was loaded correctly using the head() method.
In [ ]:
#Load data using Pandas read_excel function
csb_data = pd.read_excel("data/kwb-2020.xls")
#Use info() to check the data import
csb_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 17341 entries, 0 to 17340
Columns: 119 entries, gwb_code_10 to ste_oad
dtypes: float64(33), int64(37), object(49)
memory usage: 15.7+ MB

I seee some data very nice. I notice that each neighborhood is twice in the data set as there is a row for Buurt and Wijk, respecitvely. Furthermore, I notice that the dataset also includes many other cities in the Netherlands that are not of interest for this analysis. Hence, I will subset the data to only include the data for The Hague.

In [ ]:
#By selecting the rows where the column gm_naam is equal to 's-Gravenhage' and remove the duplicates by only choosing the "Buurt" rows
csb_data_den_haag = csb_data[(csb_data["gm_naam"] == "'s-Gravenhage") & (csb_data["recs"] == "Buurt")]
#Afterwards, I will check the first 5 rows of the data to see if the data was loaded correctly using the head() method.
csb_data_den_haag.head()
Out[ ]:
gwb_code_10 gwb_code_8 regio gm_naam recs gwb_code ind_wbi a_inw a_man a_vrouw ... g_afs_kv g_afs_sc g_3km_sc a_opp_ha a_lan_ha a_wat_ha pst_mvp pst_dekp ste_mvs ste_oad
7766 BU05180170 5180170 Oostduinen 's-Gravenhage Buurt BU05180170 1 0 0 0 ... . . . 318 308 10 2597 5 4 683
7768 BU05180271 5180271 Belgisch Park 's-Gravenhage Buurt BU05180271 1 8350 4095 4250 ... 0,4 0,5 10,9 106 106 0 2587 2 2 2497
7770 BU05180373 5180373 Westbroekpark 's-Gravenhage Buurt BU05180373 1 1070 475 590 ... 0,7 0,7 18,5 46 41 5 2597 4 2 2402
7771 BU05180374 5180374 Duttendel 's-Gravenhage Buurt BU05180374 1 1100 490 610 ... 0,4 0,4 14,3 133 130 3 2597 1 3 1135
7773 BU05180448 5180448 Nassaubuurt 's-Gravenhage Buurt BU05180448 1 1620 790 830 ... 0,2 0,4 22,0 29 28 0 2596 1 1 3307

5 rows × 119 columns

I see that my dataset now seems to only include data for The Hague on a Buurt level. I now want to have a quick overview of the size of the dataset to get a feeling for the amount of data I am working with.

In [ ]:
# display the number of rows and columns
csb_data_den_haag.shape
Out[ ]:
(114, 119)

I observe that the dataset has 114 rows and 119 columns. In the following I want to subset this data to only include the columns I am interested in. I will do this by selecting the columns "regio", "a_inw", "g_woz", "a_opl_lg", "bev_dich" and "a_opl_lg". I will save this subset in the variable csb_data_haag_subset. To facilitate the subset generation I save the column names in a list called measures. Furthermore, I create a measure_dictionary that will be used later on to rename the columns.

In [ ]:
# List of columns to keep
measures = ['a_inw','g_woz','a_opl_lg','bev_dich','regio']
#dictionary of columns to rename
measure_dictionary = {
    'a_inw':'Amount of Inhabitants'
    ,'g_woz':'Real Estate Value [x 1,000 €]',
    'a_opl_lg' : 'Amount of people with low education',
    'bev_dich':'Population density [#/km^2]',
    'people with low education [%]' : 'people with low education [%]'
}

With these variables it is now easy to create the adequate subdataset I will use in my analysis.

In [ ]:
#Creating subset with the measure list
csb_data_den_haag_measures = csb_data_den_haag[measures]
#Displaying overview of new subdataset
csb_data_den_haag_measures.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 114 entries, 7766 to 7922
Data columns (total 5 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   a_inw     114 non-null    int64 
 1   g_woz     114 non-null    object
 2   a_opl_lg  114 non-null    object
 3   bev_dich  114 non-null    object
 4   regio     114 non-null    object
dtypes: int64(1), object(4)
memory usage: 5.3+ KB

I see that I still have 114 rows but only my selected columns.

3.2 Data Cleaning

My new subset only includes the columns I am interested in. In the following I will clean the data. I look again at the first 5 rows and notive that the neighborhood Oostduinen is missing some values, however, this is obvious as the population is also 0. I assume that the dataset creator use "." to indicate missing values.

In [ ]:
#Display the first 5 rows of the data
csb_data_den_haag_measures.head()
Out[ ]:
a_inw g_woz a_opl_lg bev_dich regio
7766 0 . . . Oostduinen
7768 8350 398 1120 7880 Belgisch Park
7770 1070 . 60 2591 Westbroekpark
7771 1100 612 90 844 Duttendel
7773 1620 653 80 5737 Nassaubuurt
In [ ]:
#Set index to name of the neighborhood
csb_data_den_haag_measures.set_index('regio', inplace=True)
#Display the first 5 rows of the data
csb_data_den_haag_measures.head()
Out[ ]:
a_inw g_woz a_opl_lg bev_dich
regio
Oostduinen 0 . . .
Belgisch Park 8350 398 1120 7880
Westbroekpark 1070 . 60 2591
Duttendel 1100 612 90 844
Nassaubuurt 1620 653 80 5737

I see that the index is now the name of the neighborhood, which will facilate further analysis. I will now check if there are any missing values in the dataset.

I notice that some values are missing and are replaced by ".". I want to check if there are more neighborhoods with "." values to see how many rows I have to clean.

In [ ]:
#Display all rows where the column "a_opl_lg" is equal to '.'
csb_data_den_haag_measures[(csb_data_den_haag_measures['a_opl_lg'] == '.')]
Out[ ]:
a_inw g_woz a_opl_lg bev_dich
regio
Oostduinen 0 . . .
Kerketuinen en Zichtenburg 105 106 . 133
Vliegeniersbuurt 0 . . .
De Reef 170 324 . 336
Tedingerbuurt 0 . . .

I will drop the neighborhoods Ooostduinen, Vliegeniersbuurt, and Tedingerbuurt as nobody lives there. This will change the weights in my later spatial weights analysis, however, it would be wrong to include them as I am analyzing the effect of Amount of people with low education % , hence, I need citizens in the neighborhoods.

In [ ]:
#drop neighborhoods with no inhabitants
csb_data_den_haag_measures = csb_data_den_haag_measures[csb_data_den_haag_measures['a_inw']!=0]
#check for remaining rows with missing values
csb_data_den_haag_measures[(csb_data_den_haag_measures['a_opl_lg'] == '.')|(csb_data_den_haag_measures['g_woz'] == '.')|(csb_data_den_haag_measures['bev_dich'] == '.')]
Out[ ]:
a_inw g_woz a_opl_lg bev_dich
regio
Westbroekpark 1070 . 60 2591
Kerketuinen en Zichtenburg 105 106 . 133
De Reef 170 324 . 336
De Rivieren 35 . 0 24

As the measures I am interested in are not clearly differentiated at the borders of a neighborhood, I decide to use the average of the adjacent neighborhood as a measure for the neighborhood. For example schools schools can serve multiple neighborhoods and hence, the effects on the Amount of people with low education % will also display in multiple neighborhoods. I checked data sets of different years for the variables but did not find any data that would be an argument against this choosen approach.

In [ ]:
#Create matrix of neighborhood weights with Queen Approach
w_queen = weights.Queen.from_dataframe(den_haag_neighborhoods,idVariable='neighb_cbs')

I will set the amount of people with low education to 0 for the neighborhoods but will keep in mind that I have to adapt their value for percentage of people with low education later. This enables me to calculate the percentage for all neighborhoods in the first place.

In [ ]:
# Setting amount of people with low education to 0
csb_data_den_haag_measures.at['Kerketuinen en Zichtenburg','a_opl_lg'] = 0
csb_data_den_haag_measures.at['De Reef','a_opl_lg'] = 0
In [ ]:
def get_neighborhood_area_average(data,neighborhood,neighbors,measure):
    '''
    This function calculates the average of the measure for the adjacent neighbors of the neighborhood
    Parameters:
        data: the dataset containing the measure
        neighborhood: the name of the neighborhood
        neighbors: the list of neighbors of the neighborhood
        measure: the name of the measure
    Returns:
        The average of the measure for the adjacent neighbors of the neighborhood
    '''
    if measure =='a_opl_lg':
        #Return average Amount of people with low education % of the neighbors   
        return (data.loc[data.index.isin(neighbors),'a_opl_lg'].astype(float)/ data.loc[data.index.isin(neighbors),'a_inw'].astype(int) * 100).mean()
    else:
        # use mean() to calculate the average of the with .loc[] selected subset    
        return data.loc[data.index.isin(neighbors),measure].astype(int).mean()

I choose to identify neighborhoods with the Queen approach as streets run in every possible direction between neighborhoods. Thus, including neighbors at the corners will increase the quality of the average in further analysis.

In [ ]:
csb_data_den_haag_measures.at['Westbroekpark','g_woz']  = get_neighborhood_area_average(csb_data_den_haag_measures,'Westbroekpark',list(w_queen['Westbroekpark']),'g_woz')
csb_data_den_haag_measures[(csb_data_den_haag_measures['a_opl_lg'] == '.')|(csb_data_den_haag_measures['g_woz'] == '.')|(csb_data_den_haag_measures['bev_dich'] == '.')]
Out[ ]:
a_inw g_woz a_opl_lg bev_dich
regio
De Rivieren 35 . 0 24

I see that Westbroekpark now has a value for real estate value.

In [ ]:
csb_data_den_haag_measures.at['De Rivieren','g_woz']  = get_neighborhood_area_average(csb_data_den_haag_measures,'De Rivieren',list(w_queen['De Rivieren']),'g_woz')
csb_data_den_haag_measures[(csb_data_den_haag_measures['a_opl_lg'] == '.')|(csb_data_den_haag_measures['g_woz'] == '.')|(csb_data_den_haag_measures['bev_dich'] == '.')]
Out[ ]:
a_inw g_woz a_opl_lg bev_dich
regio

I see that my operation was successful and there are no missing values left. Furthermore, I check for the data types of the columns.

I see that the data type of the columns is object. I want to change this to float.

In [ ]:
#To change the data type i use the astype() function on the respective columns
csb_data_den_haag_measures['a_opl_lg'] = csb_data_den_haag_measures['a_opl_lg'].astype(float)
csb_data_den_haag_measures['g_woz'] = csb_data_den_haag_measures['g_woz'].astype(float)
csb_data_den_haag_measures['bev_dich'] = csb_data_den_haag_measures['bev_dich'].astype(float)

As described in the problem statement, I want to create the measure Amount of people with low education percentage, which is the number of Amount of people with low education people divided by the population.

In [ ]:
# Calculate Percentage of Low Education
csb_data_den_haag_measures['people with low education [%]'] = csb_data_den_haag_measures['a_opl_lg']/ csb_data_den_haag_measures['a_inw'] * 100

I now have to change the values for Kerketuinen en Zichtenburg and De Reef again as described above, as I have to use the average of the adjacent neighborhoods.

In [ ]:
# Calculate Percentage of Low Education people with the average of their adjacent neighborhoods for Kerketuinen en Zichtenburg and De Reef
csb_data_den_haag_measures.at['Kerketuinen en Zichtenburg','people with low education [%]']  = get_neighborhood_area_average(csb_data_den_haag_measures,'Kerketuinen en Zichtenburg',list(w_queen['Kerketuinen en Zichtenburg']),'a_opl_lg')
csb_data_den_haag_measures.at['De Reef','people with low education [%]']  = get_neighborhood_area_average(csb_data_den_haag_measures,'De Reef',list(w_queen['De Reef']),'a_opl_lg')

I see that no region with a "."-value is left and I therefore, have a cleaned dataframe for further analysis.

With my cleaned dataset, I want to make it more usable and thus, rename the measure columns to easily identifiable names.

In [ ]:
# I do this by using the rename() method on the dataframe. 
# I use the measure_dictionary to rename the columns. 
# I set inplace=True to change the dataframe directly.
csb_data_den_haag_measures.rename(columns=measure_dictionary, inplace=True)
csb_data_den_haag_measures.head()
Out[ ]:
Amount of Inhabitants Real Estate Value [x 1,000 €] Amount of people with low education Population density [#/km^2] people with low education [%]
regio
Belgisch Park 8350 398.0 1120.0 7880.0 13.413174
Westbroekpark 1070 472.4 60.0 2591.0 5.607477
Duttendel 1100 612.0 90.0 844.0 8.181818
Nassaubuurt 1620 653.0 80.0 5737.0 4.938272
Uilennest 2190 380.0 120.0 7439.0 5.479452

With the last output I can confirm that all columns are now properly named and I can move on to the Exploratory Data Analysis.


4. Exploratory Data Analysis

4.1 Distribution

In the following I will explore the data. I will do this by creating plots and using statistical methods to get a better understanding of the data. To facilate the plotting, I will create a function that will plot the data for a given measure. I will do this by creating a function called graph_distribution. The function takes the following arguments:

In [ ]:
def graph_distribution(data,measure,axis):  
    '''
    This function plots the distribution of the measure
    Parameters:
        data: the dataset containing the measure
        measure: the name of the measure
        axis: the axis to plot the distribution
    Returns:
        None
    '''
    #deactive grid
    axis.grid(False)    
    #remove the top and right borders
    sns.despine(top=True,right=True)
    #plot the distribution dsitrubted on 20 bins and with kernel density estimation
    sns.histplot(data[measure], kde=True, bins=20,ax=axis)

I will use this method to plot the distribution of the measures. I will do this by using the graph_distribution function on the dataframe. I will use the measure_dictionary list to iterate over the measures and plot the distribution for each measure.

In [ ]:
# Create 3 subplots and define size of figure
figure, axes = plt.subplots(1, 3, figsize=(15, 5))
# Define measures to plot
measures_to_graph = ['Real Estate Value [x 1,000 €]','people with low education [%]','Population density [#/km^2]']
# Loop over the measures and axes and plot them, ravel returns a flattened array for the loop
# zip() function returns an iterator of tuples based on the iterable object
for measure,ax in zip(measures_to_graph,axes.ravel()):
    #plot the distribution using graph_distribution()   
    graph_distribution(csb_data_den_haag_measures,measure,ax)

This graph checks the pillars of graphical exellence as it:

  • it shows the data
  • serves a clear purpose of identifying the distribution of the data
  • makes efficient use of space by displaying three graphs next to each other

With these distribution charts I can see that the real estate value is heavily skewed to the left, meaning that the majority of the nieghborhoods have a relatively low real estate value and there is a wide range of values. The Amount of people with low education % is more normally distributed, interestingly is already has a saddle point at circa 12% in the kernel density estimation. I would assume that there is a certain area where multiple neighborhoods have a similiar percentage leading to this distribution. The population density is the most normal distributed parameter of the three measures as it can be seen in the KDE. What I notice here is that there is a high count in the first bin. This might be due to the fact how Den Haag's neighborhoods are sketched as some only include a park (e.g. Zuiderpark) and just a few residential houses leading to low population density. Furthermore, the range of values is large as the there are some neighborhoods with a population density of over 20,0000 citizens/km^2. The amount of bins stay the same at 20 leading to a large first bin that "collects" a large amount of not densily populated areas.

In [ ]:
#use describe() on dataframe column to get descriptive statistics
csb_data_den_haag_measures['Real Estate Value [x 1,000 €]'].describe()
Out[ ]:
count    111.000000
mean     305.100129
std      156.067976
min      106.000000
25%      181.500000
50%      284.000000
75%      375.000000
max      740.000000
Name: Real Estate Value [x 1,000 €], dtype: float64

My results of the distribution graphs can be emphasized with the actual statistical numbers of the columns. For the real estate value I see that the maximum (740) is almost 3 sigma from the mean (305) while more than 75% of the data are in a 2 sigma range from the min (106). This emphasizes the skewness to the left and the unequal distribution of real estate value.

In [ ]:
#use describe() on dataframe column to get descriptive statistics
csb_data_den_haag_measures['people with low education [%]'].describe()
Out[ ]:
count    111.000000
mean      21.043396
std       10.914788
min        0.000000
25%       11.637259
50%       20.689655
75%       29.215544
max       48.484848
Name: people with low education [%], dtype: float64

For the Amount of people with low education %, I notice that there is at least one neighborhood with a value of 0% while the mean is at 21%. 50% of the values are in the 1 sigma range(+-10%) from the mean. The maximum of 48% again seems to be a clear outlier being more than 3 sigmas from the mean.

In [ ]:
##use describe() on dataframe column to get descriptive statistics
csb_data_den_haag_measures['Population density [#/km^2]'].describe()
Out[ ]:
count      111.000000
mean      9193.765766
std       5938.341561
min         24.000000
25%       5049.500000
50%       8772.000000
75%      12548.500000
max      22381.000000
Name: Population density [#/km^2], dtype: float64

For the population density I see an extreme range of values as the minimum is at 24 inhabitans/km^2 while the maximum is at 22,381. The mean is at 9193 inhabitans/km^2.

In [ ]:
def graph_scatterplot(data,measure1,measure2,axis):
    """creates a scatterplot using the provided dataframe and x and y measures
    
    Keyword arguments:
        data -- the dataframe to use
        measure1 -- the x measure name
        measure2 -- the y measure name
        axis -- the axis to plot on

    Return: 
        None
    """
    
    #deactive grid
    axis.grid(False)
    
    #remove the top and right borders
    sns.despine(top=True,right=True)
    #graph scatterplot with provided arguments
    sns.scatterplot(x = data[measure1], y=data[measure2],ax=axis)

In the following I want to use this function to create three scatterplots next to each other, that display the relationship between the real estate value and the Amount of people with low education percentage and the population density, respectively. Furthermore, as stated previously, I want to analyze the relationship between the Amount of people with low education % and the population density.

In [ ]:
# Create Subplots with matplot
f,ax = plt.subplots(1,3,figsize=(15,5))
#graph the scatterplots using the graph_scatterplot function and axes of subplots
graph_scatterplot(csb_data_den_haag_measures,'people with low education [%]','Real Estate Value [x 1,000 €]',ax[0]) 
graph_scatterplot(csb_data_den_haag_measures,'Population density [#/km^2]','Real Estate Value [x 1,000 €]',ax[1])
graph_scatterplot(csb_data_den_haag_measures,'Population density [#/km^2]','people with low education [%]',ax[2])

This graph checks the pillars of graphical exellence as it:

  • it shows the data
  • serves a clear purpose of identifying the distribution of the data
  • makes efficient use of space by displaying three graphs next to each other

In the first diagram I notice the minimum of the real estate value at about 100. I notice an outlier with more than 30% Amount of people with low education % but a real estate value above 700. Overall, the relationship is only weakly linear and can rather be estimated with a shifted exponential function with negative exponent. For the population density and real estate value I can see a more linear relationship as the real estate value decreases with increasing population density. For the third graph I can see a linear relationship between the Amount of people with low education % and the population density. Hence, the two factors might influence each other. However, I also see two outliers with a high education % of almost 50% but a low population density. I will keep an eye out for these neighborhoods in the following analysis as these might be of interest.

4.2 Correlation

I want to analyze the correlation between the measures to strengthen my hypothesis and also identify the relationship between my two X variables, Amount of people with low education % and population density as their influence on each other is relevant for any further statistical analyzis, eg., linear regression.

In [ ]:
#Use heatmap to show correlation between measures
sns.heatmap(csb_data_den_haag_measures[measures_to_graph].corr(), annot=True)
Out[ ]:

This graph checks the pillars of graphical exellence as it:

  • it shows the data
  • serves a clear purpose of identifying the correlation of the data
  • makes efficient use of space by displaying a correlation matrix

To put my correlation analyzis into numbers, I analyze the correlation matrix of the dataframe. I see that the correlation between the real estate value and the Amount of people with low education % is -0.65. This is a moderate negative correlation. The correlation between the real estate value and the population density is -0.59. This is a moderate negative correlation. The correlation between the Amount of people with low education % and the population density is 0.45. This is a moderate positive correlation. These numbers confirm the initial idea of my hypothesis and hence, I want to further analyze the dataset with this hypothesis. More graphs that only focus on statistical measures are not necessary at this point and thus, I will move on to analyze the data on a map.

4.3 Creating combined dataset

In the following I will create a combined dataset that will be used for the analysis. I will do this by merging the csb_data_haag_subset dataframe with the shapefile. I will use the column "neighb_cbs" as the key to merge the dataframes as I obtained my data from cbs. I will save the result in a new dataframe to modularize my notebook and simplify debugging later on.

I noticed that Kraayenstein has a different name in the shapefile and in the geodataframe. Hence, I will change it to make the merge possible.

In [ ]:
#replace Kraayenstein with Kraayenstein en Vroondaal, setting inplace = True to change the dataframe directly
den_haag_neighborhoods.replace('Kraayenstein','Kraayenstein en Vroondaal',inplace=True)
#Check if rename was successful
den_haag_neighborhoods[den_haag_neighborhoods['neighb_cbs']=='Kraayenstein en Vroondaal']
Out[ ]:
neighb_cbs neigb_cijf geometry
31 Kraayenstein en Vroondaal 97 Kraayenstein & Vroondaal POLYGON ((4.23657 52.05055, 4.23661 52.05051, ...

I see that the name changed and both datafrmaes are now ready for the merge.

In [ ]:
#merge the two dataframes using neighb_cbs and regio as keys
den_haag_neighborhoods_with_measures = den_haag_neighborhoods.merge(csb_data_den_haag_measures, left_on='neighb_cbs',right_on='regio')
#Show overview of new merged dataframe
den_haag_neighborhoods_with_measures.info()
<class 'geopandas.geodataframe.GeoDataFrame'>
Int64Index: 111 entries, 0 to 110
Data columns (total 8 columns):
 #   Column                               Non-Null Count  Dtype   
---  ------                               --------------  -----   
 0   neighb_cbs                           111 non-null    object  
 1   neigb_cijf                           111 non-null    object  
 2   geometry                             111 non-null    geometry
 3   Amount of Inhabitants                111 non-null    int64   
 4   Real Estate Value [x 1,000 €]        111 non-null    float64 
 5   Amount of people with low education  111 non-null    float64 
 6   Population density [#/km^2]          111 non-null    float64 
 7   people with low education [%]        111 non-null    float64 
dtypes: float64(4), geometry(1), int64(1), object(2)
memory usage: 7.8+ KB

The merge was successful and I now have a dataset that includes geographical information and the measures I am interested in.

3.4 Choropleths

I want to further analyze the relationship between Low education % population density and real estate value. To identify more specifically interesting neighborhoods, I want to create multiple choropleths. Choropleths is a statistical map that displays the value of a variable for a given area. To create the choropleths multiple times, I will create a function that will create a choropleth for a given measure.

In [ ]:
def create_choroplath_map(data, measure,axis, title):
    """ Creates a choropleth map of the given measure in the given data.
    
        Parameters:
                data (geodataframe_crs): The data to plot
                measure (str): The measure to plot
                axis (axis): The axis to plot on
                title (str): The title of the plot
        Returns:
                None
         """
    #adapt geodataframes to same coordinate system
    den_haag_neighborhoods_with_measures_crs = data.to_crs(epsg=3857)
    den_haag_neighborhoods_crs = den_haag_neighborhoods.to_crs(epsg=3857)
    #plot neighborhoods as base layer showing dropped neighborhoods in grey
    den_haag_neighborhoods_crs.plot(ax=axis, color='grey', edgecolor='black', linewidth=0.5)
    #add layer with choropleth map
    den_haag_neighborhoods_with_measures_crs.plot(column=measure,alpha=0.5,legend=True, legend_kwds={'orientation': "horizontal"},cmap='plasma',ax=axis)
    #disable grid
    axis.grid(False)
    #Set Title to measure name
    axis.set_title(title)
    #disable axis
    axis.set_axis_off()
    #add street view as layer
    cx.add_basemap(ax=ax, url=cx.providers.CartoDB.Positron)
    
In [ ]:
# Create 3 subplots and define size
figure, axes = plt.subplots(1,3, figsize=(30, 10))
# Create a list of measures to plot
measures_to_graph = ['Real Estate Value [x 1,000 €]','people with low education [%]','Population density [#/km^2]']
# Loop over the measures and plot them using create_chlorpleth_map()
for measure,ax in zip(measures_to_graph,axes.ravel()):    
     create_choroplath_map(den_haag_neighborhoods_with_measures,measure,ax,measure)

   

The displayed graphs show the values across the neighborhoods for the different measures. Furthermore, I implemented a base map in the background to facilate the identification of neighborhoods. One can easily identify the segregation in The Hague. The neighborhoods with the highest real estate value are

In [ ]:
#figure and subplots for 3 choroplaths
figure, axes = plt.subplots(1,3, figsize=(30, 10))
# Loop over the measures and plot them
for measure,ax in zip(measures_to_graph,axes.ravel()):
    #create a list of the neighborhoods with the 5 highest values  
    highest_values = den_haag_neighborhoods_with_measures.sort_values(by=measure,ascending=False).head(n=5)
    #create choroplath map of the 5 neighborhoods with the highest values
    create_choroplath_map(highest_values,measure,ax, title = measure + ' 5 highest')

In the three graphs I can identify the 5 neighborhoods with the highest values for each measure. For real estate value they tend to be in the north-west. For Amount of people with low education % they tend to be in the center. For population density they tend to be in the center. For population density and Amount of people with low education % I can see that three are also conglomerated in the center, for both measures.

In [ ]:
#figure and subplots for 3 choroplaths
figure, axes = plt.subplots(1,3, figsize=(30, 10))
# Loop over the measures and plot them
for measure,ax in zip(measures_to_graph,axes.ravel()):
    #create a list of the neighborhoods with the 5 lowest values  
    highest_values = den_haag_neighborhoods_with_measures.sort_values(by=measure).head(n=5)
    #create choroplath map of the 5 neighborhoods with the lowest values
    create_choroplath_map(highest_values,measure,ax, title = measure + ' 5 lowest')

In the three graphs I can identify the neighborhoods with the lowest values for each measure. For real estate value they tend to be in the south. For Amount of people with low education % they tend to be in the north. For population density they tend to be in the east. For population density I can see that three are also conglomerated in the east close to the highway, I assume that these might be industrial areas.


4. Selection of Neighborhood of Interest

4.1 Local Indicators of Spatial Association (LISA)

To provide further insights in my analysis I want to have a look on the spatial weights of my neighborhoods and furthermore, also look at the Local Indicators of Spatial Association (LISA). LISA is a method that can be used to identify clusters of similar values in a dataset. It can be used to identify clusters of high or low values. The method is based on the spatial weights of the data. The spatial weights are the connections between the data points. The spatial weights can be calculated in different ways. I will use the Queen contiguity method to calculate the spatial weights. This method will calculate the spatial weights based on the contiguity of the data points. The contiguity is based on the distance between the data points. If the distance between two data points is 1, they are considered to be contigous. The LISA-method also divides the neighborhoods in 4 quadrants (HH, LL, HL, LH). HH means a cluster of high values, LL means a cluster of low values, HL means a cluster of high values surrounded by low values and LH means a cluster of low values surrounded by high values. Especially the HL and LH clusters are of interest as they can be seen as outliers.

In [ ]:
def create_spatial_weights(data, measure):
        """ Creates a spatial weights matrix of the given measure in the given data.
    
        Parameters:
                data (geodataframe_crs): The data
                measure (string): The measure
        Returns:
                Local Indicators of Spatial Association (LISA) object
         """
        #Create a spatial weights matrix
        w = weights.Queen.from_dataframe(data)
        #Transform
        w.transform = 'r'
        #Spatial Lag
        data['w_'+measure] = weights.lag_spatial(w, data[measure])
        #Standardize
        data[measure+'_std'] = (data[measure] - data[measure].mean()) / data[measure].std()
        #standarized weights
        data['w_'+measure+'_std'] = weights.lag_spatial(w, data[measure+'_std'])
        #Create LISA object
        lisa = esda.Moran_Local(data[measure], w)
        #Significance
        data['significant'] = lisa.p_sim < 0.05
        #Quadrants
        data['quadrant'] = lisa.q
        #return LISA object
        return lisa

I now want to display LISA for the different measures to identify outliers in my data.

In [ ]:
def create_lisa_cluster_graph(lisa,geodataframe,measure):
    '''
    Creates a graph of the LISA clusters
    Parameters:
                data (geodataframe_crs): The data
                lisa (LISA object): The LISA object
                measure (string): The measure name
    Returns:
        None
    '''

    
    # Setup the figure and axis
    f, ax = plt.subplots(1, figsize=(7,7))
    # Plot insignificant clusters
    
    den_haag_neighborhoods_crs = den_haag_neighborhoods.to_crs(epsg=3857)
    geodataframe_crs = geodataframe.to_crs(epsg=3857)
    # Break observations into significant or not
    geodataframe_crs['significant'] = lisa.p_sim < 0.05
    # Store the quadrant they belong to
    geodataframe_crs['quadrant'] = lisa.q
    #baselayer in white
    den_haag_neighborhoods_crs.plot(ax=ax, color='white', edgecolor='black', linewidth=0.5)
    #plot insignificant clusters in light-grey
    ns = geodataframe_crs.loc[geodataframe_crs['significant']==False, 'geometry']
    ns.plot(ax=ax, color='#dedede')
    # Plot HH clusters in orange
    hh = geodataframe_crs.loc[(geodataframe_crs['quadrant']==1) & (geodataframe_crs['significant']==True), 'geometry']
    hh.plot(ax=ax, color='#eb8134',label='HH')
    # Plot LL clusters in blue
    ll = geodataframe_crs.loc[(geodataframe_crs['quadrant']==3) & (geodataframe_crs['significant']==True), 'geometry']
    ll.plot(ax=ax, color='#6baed6')
    # Plot LH clusters in light-orange
    lh = geodataframe_crs.loc[(geodataframe_crs['quadrant']==2) & (geodataframe_crs['significant']==True), 'geometry']
    lh.plot(ax=ax, color='#b4d5ea')
    # Plot HL clusters in light-blue
    hl = geodataframe_crs.loc[(geodataframe_crs['quadrant']==4) & (geodataframe_crs['significant']==True), 'geometry']
    hl.plot(ax=ax, color='#f2ae7d')

    cx.add_basemap(ax=ax, url=cx.providers.CartoDB.Positron)
    # Style and draw
    ax.set_title('LISA for '+measure)

    #disable axis
    ax.set_axis_off()
    #Show map
    plt.show()
In [ ]:
den_haag_neighborhoods_with_measures_for_lisa = den_haag_neighborhoods_with_measures.copy()
# Create LISA object for people with low education [%] using the create_spatial_weights method passing the data and the measure name
lisa_education = create_spatial_weights(den_haag_neighborhoods_with_measures_for_lisa, 'people with low education [%]')
#Displaying the LISA cluster with my created LISA object
create_lisa_cluster_graph(lisa_education, den_haag_neighborhoods_with_measures,'people with low education [%]')

This graph checks the pillars of graphical exellence as it:

  • it shows the data
  • serves a clear purpose of identifying outliers
  • induces the viewer to think about the substance as together with the background map the viewer can identify the neighborhoods and hypothesize about the reasons for the outliers and conglomerated areas

I can identify the following regions:

  1. South-West of the city center with high amount of people with low education percentage across the neighborhoods in red
  2. The north-east with low amount of people with low education percentage across the neighborhoods in dark-blue

In light blue I can see the neighborhood with a low amount of people with low education but surrounded by neighborhoods with a high amount of people with low education. This is the neighborhood Huygenspark. I will keep an eye on this neighborhood in the following analysis. In light orange I can see the neighborhood with a high amount of people with low education but surrounded by neighborhoods with a low amount of people with low education. This is the neighborhood Rietbuurt. I will keep an eye on this neighborhood in the following analysis.

In [ ]:
# Create LISA object for people with low education [%] using the create_spatial_weights method passing the data and the measure name
lisa_real_estate_value = create_spatial_weights(den_haag_neighborhoods_with_measures_for_lisa, 'Real Estate Value [x 1,000 €]')
#Displaying the LISA cluster with my created LISA object
create_lisa_cluster_graph(lisa_real_estate_value, den_haag_neighborhoods_with_measures_for_lisa,'Real Estate Value [x 1,000 €]')

This graph checks the pillars of graphical exellence as it:

  • it shows the data
  • serves a clear purpose of identifying outliers
  • induces the viewer to think about the substance as together with the background map the viewer can identify the neighborhoods and hypothesize about the reasons for the outliers and conglomerated areas

In dark blue I see the areas with low real estate value surrounded by areas with low real estate value, this is the city center and east of it. In dark orange I see the neighborhoods with a high real estate value surrounded by neighborhoods with a high real estate value. In light blue I see the neighborhoods with a low real estate value surrounded by neighborhoods with a high real estate value. These are:

  • Visserijbuurt
  • Zeeheldenkwartier
  • Haagse Bos

Haagse Bos can be explained by the fact that it is a park. Visserijbuurt and Zeeheldenkwartier are of interest for further analysis. In light orange I see the neighborhoods with a high real estate value surrounded by neighborhoods with a low real estate value. These are:

  • Marlot
  • De Uithof

Marlot can be explained as Huis ten Bosch, the home of the Dutch Royal family, is located in this neighborhood hence, increasing the real estate value drastically. De Uithof is of interest for further analysis.

In [ ]:
# Create LISA object for people with low education [%] using the create_spatial_weights method passing the data and the measure name
lisa_population_density = create_spatial_weights(den_haag_neighborhoods_with_measures_for_lisa, 'Population density [#/km^2]')
#Displaying the LISA cluster with my created LISA object
create_lisa_cluster_graph(lisa_population_density, den_haag_neighborhoods_with_measures_for_lisa,'Population density [#/km^2]')

This graph checks the pillars of graphical exellence as it:

  • it shows the data
  • serves a clear purpose of identifying outliers
  • induces the viewer to think about the substance as together with the background map the viewer can identify the neighborhoods and hypothesize about the reasons for the outliers and conglomerated areas

I see neighborhoods with high population density surrounded by other neighborhoods with population density in the city center. However, ther is also a light blue nieghborhood indicating low population density surrounded by high population density. This is Zuiderpark here, which makes sense as the neighborhood mostly consists of the Zuiderpark.

4.2 Selection of neighborhoods

To facilate the identifcation of the neighborhoods, I used the interactive map of the neighborhoods below.

In [ ]:
#show interactive map of Den Haag
den_haag_neighborhoods.explore()
Out[ ]:
Make this Notebook Trusted to load map: File -> Trust Notebook

Based on my statistical and graphical analysis, I will select the following neighborhoods that should be used in further analysis:

  • Visserijbuurt, because it has a high real estate value but is surrounded by neighborhoods with a low real estate value
  • Zeeheldenkwartier, because it has a high real estate value but is surrounded by neighborhoods with a low real estate value
  • De Uithof, because it has a high real estate value but is surrounded by neighborhoods with a low real estate value
  • Huygenspark, because it has a low amount of people with low education but is surrounded by neighborhoods with a high amount of people with low education
  • Rietbuurt, because it has a high amount of people with low education but is surrounded by neighborhoods with a low amount of people with low education